Creating a tagset, lexicon and guesser for a French tagger* 
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Abstract 

We earlier described two taggers for 
French, a statistical one and a constraint- 
based one. The two taggers have the same 
tokeniser and morphological analyser. In 
this paper, we describe aspects of this work 
concerned with the definition of the tagset, 
the building of the lexicon, derived from an 
existing two-level morphological analyser, 
and the definition of a lexical transducer 
for guessing unknown words. 



1 Background 

We earlier described two taggers for French: the sta- 
tistical one having an accuracy of 95-97 % and the 



constraint-based one 97-99 % (see (Chanod and Ta- 



E 



1994 



Chanod and Tapanainen, 1995 )). 
The disambiguation has been already described, and 
here we discuss the other stages of the process, 
namely the definition of the tagset, transforming 
a current lexicon into a new one and guessing the 
words that do not appear in the lexicon. 

Our lexicon is based on a finite-state transducer 
lexicon ( Karttunen et al., 1992 ). The French 
description was originally built by Annie Zaenen 
and Car ol Nei dle, and later refined by Jean-Pierre 
Chanod (|1994|). 



Related work on French can be found in (Der- 
ouault, 1985|). 



2 Tagset 

We describe in this section criteria for selecting the 
tagset. The following is based on what we noticed 
to be useful during the developing the taggers. 



presented in the ACL SIGDAT workshop on From 
Texts To Tags: Issues In Multilingual Language Anal- 
ysis, pages 58-64. University College Dublin, Ireland, 
1995. 



2.1 The size of the tagset 

Our basic French morphological analyser was not 
originally designed for a (statistical) tagger and the 
number of different tag combinations it has is quite 
high. The size of the tagset is only 88. But because a 
word is typically associated with a sequence of tags, 
the number of different combinations is higher, 353 
possible sequences for single French words. If we also 
consider words joined with clitics, the number of dif- 
ferent combinations is much higher, namely 6525. 

A big tagset does not cause trouble for a 
constraint-based tagger because one can refer to a 
combination of tags as easily as to a single tag. For 
a statistical tagger however, a big tagset may be a 
major problem. We therefore used two principles for 
forming the tagset: (1) the tagset should not be big 
and (2) the tagset should not introduce distinctions 
that cannot be resolved at this level of analysis. 

2.2 Verb tense and mood 

As distinctions that cannot be resolved at this level 
of analysis should be avoided, we do not have infor- 
mation about the tense of the verbs. Some of this 
information can be recovered later by performing an- 
other lexicon lookup after the analysis. Thus, if the 
verb tense is not ambiguous, we have not lost any in- 
formation and, even if it is, a part-of-speech tagger 
could not resolve the ambiguity very reliably any- 
way. For instance, dort (present; sleeps) and dormira 
(future; will sleep) have the same tag VERB-SG-P3, 
because they are both singular, third-person forms 
and they can both be the main verb of a clause. If 
needed, we can do another lexicon lookup for words 
that have the tag VERB-SG-P3 and assign a tense 
to them after the disambiguation. Therefore, the 
tagset and the lexicon together may make finer dis- 
tinctions than the tagger alone. 

On the other hand, the same verb form dit can be 
third person singular present indicative or third per- 
son singular past historic (passe simple) of the verb 



dire (to say). We do not introduce the distinction 
between those two forms, both tagged as VERB-SG- 
P3, because determining which of the two tenses is to 
be selected in a given context goes beyond the scope 
of the tagger. However, we do keep the distinction 
between dit as a finite verb (present or past) on one 
side and as a past participle on the other, because 
this distinction is properly handled with a limited 
contextual analysis. 

Morphological information concerning mood is 
also collapsed in the same way, so that a large class 
of ambiguity between present indicative and present 
subjunctive is not resolved: again this is motivated 
by the fact that the mood is determined by remote 
elements such as, among others, connectors that can 
be located at (theoretically) any distance from the 
verb. For instance, a conjunction like quoique re- 
quires the subjunctive mood: 

Quoique, en principe, ce cas soit frequent. 
(Though, in principle, this case is [subjunc- 
tive] frequent.) 

The polarity of the main verb to which a sub- 
ordinate clause is attached also plays a role. For 
instance, compare: 

Je pense que les petits enfants font de jo- 
bs dessins. (I think that small kids make 
[indicative] nice drawings.) 

Je ne pense pas pas que les petits enfants 
fassent de jolis dessins. (I do not think 
that small kids make [subjunctive] nice 
drawings.) 



Consequently, forms like chante are tagged as 
VERB-P3SG regardless of their mood. In the case 
of faire (to do, to make) however, the mood infor- 
mation can easily be recovered as the third person 
plural are font and fassent for indicative and sub- 
junctive moods respectively. 

2.3 Person 

The person seems to be problematic for a statisti- 
cal tagger (but not for a constraint-based tagger). 
For instance, the verb pense, ambiguous between the 
first- and third-person, in the sentence Je ne le pense 
pas (I do not think so) is disambiguated wrongly 
because the statistical tagger fails to see the first- 
person pronoun je and selects more common third- 
person reading for the verb. 

We made a choice to collapse the first- and second- 
person verbs together but not the third person. The 



reason why we cannot also collapse the third per- 
son is that we have an ambiguity class that contains 
adjective and first- or second-person verbs. In a 
sentence like Le secteur matieres (NOUN-PL) plas- 
tiques (ADJ-PL/NOUN-PL/VERB-P1P2). . . the 
verb reading for plastiques is impossible. Because 
noun — third-person sequence is relatively common, 
collapsing also the third person would cause trouble 
in parsing. 

Because we use the same tag for first- and second- 
person verbs, the first- and second-person pronouns 
are also collapsed together to keep the system con- 
sistent. Determining the person after the analysis 
is also quite straightforward: the personal pronouns 
are not ambiguous, and the verb form, if it is am- 
biguous, can be recovered from its subject pronoun. 

2.4 Lexical word-form 

Surface forms under a same lexical item were also 
collapsed when they can be attached to different 
lemmata (lexical forms) while sharing the same cate- 
gory, such as peignent derived from the verb peigner 
(to comb) or peindre (to paint). Such coincidental 



situations are very rare in French (El-Bcze, 1993). 
However, in the case of suis first person singular of 
the auxiliary etre (to be) or of the verb suivre (to fol- 
low), the distinction is maintained, as we introduced 
special tags for auxiliaries. 

2.5 Gender and number 

We have not introduced gender distinctions as far as 
nouns and adjectives (and incidentally determiners) 
are concerned. Thus a feminine noun like chaise 
(chair) and a masculine noun like tabouret (stool) 
both receive the same tag NOUN-SG. 

However, we have introduced distinctions between 
singular nouns (NOUN-SG), plural nouns (NOUN- 
PL) and number- invariant nouns (NOUN-INV) such 
as taux (rate/rates). Similar distinctions apply to 
adjectives and determiners. The main reason for 
this choice is that number, unlike gender, plays a 
major role in French with respect to subject/verb 
agreement, and the noun/verb ambiguity is one of 
the major cases that we want the tagger to resolve. 

2.6 Discussion on Gender 

Ignoring gender distinction for a French tagger is 
certainly counter intuitive. There are three major 
objections against this choice: 

• Gender information would provide better dis- 
ambiguation, 

• Gender ambiguous nouns should be resolved, 
and 



• Displaying gender provides more information. 

There is obviously a strong objection against leav- 
ing out gender information as this information may 
provide a better disambiguation in some contexts. 
For instance in le diffuseur diffuse, the word diffuse 
is ambiguous as a verb or as a feminine adjective. 
This last category is unlikely after a masculine noun 
like diffuseur. 

However, one may observe that gender agreement 
between nouns and adjectives often involve long dis- 
tance dependencies, due for instance to coordina- 
tion or to the adjunction of noun complements as in 
une envie de soleil diffuse where the feminine adjec- 
tive diffuse agrees with the feminine noun envie. In 
other words, introducing linguistically relevant in- 
formation such as gender into the tagset is fine, but 
if this information is not used in the linguistically 
relevant context, the benefit is unclear. Therefore, 
if a (statistical) tagger is not able to use the relevant 
context, it may produce some extra errors by using 
the gender. 

An interesting, albeit minor interest of not intro- 
ducing gender distinction, is that there is then no 
problem with tagging phrases like mon allusion (my 
allusion) where the masculine form of the possessive 
determiner mon precedes a feminine singular noun 
that begins with a vowel, for euphonic reasons. 

Our position is that situations where the gen- 
der distinction would help are rare, and that the 
expected improvement could well be impaired by 
new errors in some other contexts. On a test suite 



(Chanod and Tapanaincn, 1995) extracted from the 
newspaper Le Monde (12 000 words) tagged with 
either of our two taggers, we counted only three 
errors that violated gender agreement. Two could 
have been avoided by other means, i.e. they belong 
to other classes of tagging errors. The problematic 
sentence was: 

L'armee interdit d' autre part le passage. . . 
(The army forbids the passage. . . ) 

where interdit is mistakenly tagged as an adjective 
rather than a finite verb, while armee is a femi- 
nine noun and interdit a masculine adjective, which 
makes the noun-adjective sequence impossible in 
this particular sentence^]. 

Another argument in favour of gender distinc- 
tion is that some nouns are ambiguously masculine 

We have not systematically compared the two ap- 
proaches, i.e. with or without gend er distinction, but 
previous experiences flChanod, 19931 ) with broad cover- 
age parsing of possibly erroneous texts have shown that 
gender agreement is not as essential as one may think 
when it comes to French parsing. 



or feminine, with possible differences in meaning, 
e.g. poste, garde, manche, tour, page. A tagger that 
would carry on the distinction would then provide 
sense disambiguation for such words. 

Actually, such gender-ambiguous words are not 
very frequent. On the same 12 000- word test corpus, 
we counted 46 occurrences of words which have dif- 
ferent meanings for the masculine and the feminine 
noun readings. This number could be further re- 
duced if extremely rare readings were removed from 
the lexicon, like masculine ombre (a kind of fish while 
the feminine reading means shadow or shade) or fem- 
inine litre (a religious ornament). We also counted 
325 occurrences of nouns (proper nouns excluded) 
which do not have different meanings in the mascu- 
line and the feminine readings, e.g. eleve, camarade, 
jeune. 

A reason not to distinguish the gender of such 
nouns, besides their sparsity, is that the immediate 
context does not always suffice to resolve the ambi- 
guity. Basically, disambiguation is possible if there is 
an unambiguous masculine or feminine modifier at- 
tached to the noun as in le poste vs. la poste. This is 
often not the case, especially for preposition + noun 
sequences and for plural forms, as plural determin- 
ers themselves are often ambiguous with respect to 
gender. For instance, in our test corpus, we find ex- 
pressions like en 225 pages, a leur tour, a ces postes 
and pour les postes de responsabilite for which the 
contextual analysis does not help to disambiguate 
the gender of the head noun. 

Finally, carrying the gender information does not 
itself increase the disambiguation power of the tag- 
ger. A disambiguator that would explicitly mark 
gender distinctions in the tagset would not neces- 
sarily provide more information. A reasonable way 
to assess the disambiguating power of a tagger is 
to consider the ratio between the initial number of 
ambiguous tags vs. the final number of tags after 
disambiguation. For instance, it does not make any 
difference if the ambiguity class for a word like table 
is [feminine-noun, finite-verb] or [noun, finite-verb], 
in both cases the tagger reduces the ambiguity by a 
ratio of 2 to 1. The information that can be derived 
from this disambiguation is a matter of associating 
the tagged word with any relevant information like 
its base form, morphological features such as gen- 
der, or even its definition or its translation into some 
other language. This can be achieved by looking up 
the disambiguated word in the appropriate lexicon. 
Providing this derived information is not an intrinsic 
property of the tagger. 

Our point is that the objections do not hold very 
strongly. Gender information is certainly important 



in itself. We only argue that ignoring it at the level 
of part-of-speech tagging has no measurable effect 
on the overall quality of the tagger. On our test cor- 
pus of 12 000 words, only three errors violate gender 
agreement. This indicates how little the accuracy of 
the tagger could be improved by introducing gender 
distinction. On the other hand, we do not know how 
many errors would have been introduced if we had 
distinguished between the genders. 

2.7 Remaining categories 

We avoid categories that are too small, i.e. rare 
words that do not fit into an existing category are 
collapsed together. Making a distinction between 
categories is not useful if there are not enough oc- 
currences of them in the training sample. We made 
a category MISC for all those miscellaneous words 
that do not fit into any existing category. This ac- 
counts for words such as: interjection oh, salutation 
bonjour, onomatopoeia miaou, wordparts i.e. words 
that only exist as part of a multi-word expression, 
such as priori, as part of a priori. 

2.8 Dividing a category 

In a few instances, we introduced new categories for 
words that have a specific syntactic distribution. For 
instance, we introduced a word-specific tag PREP- 
DE for words de, des and du, and tag PREP- A for 
words a, au and aux. Word-specific tags for other 
prepositions could be considered too. The other 
readings of the words were not removed, e.g. de is, 
ambiguously, still a determiner as well as PREP-DE. 

When we have only one tag for all the preposi- 
tions, for example, a sequence like 

determiner noun noun/ verb preposition 

is frequently disambiguated in the wrong way by the 
statistical tagger, e.g. Le train part a cinq heures 
( The train leaves at 5 o 'clock) . The word part is am- 
biguous between a noun and a verb (singular, third 
person), and the tagger seems to prefer the noun 
reading between a singular noun and a preposition. 

We succeeded in fixing this without modifying the 
tagset but the side-effect was that overall accuracy 
deteriorated. The main problem is that the prepo- 
sition de, comparable to English of, is the most 
common preposition and also has a specific distri- 
bution. When we added new tags, say PREP-DE 
and PREP- A, for the specific prepositions while the 
other prepositions remained marked with PREP, we 
got the correct result, with no noticeable change in 
overall accuracy. 



3 Building the lexicon 



rien et al., 1992 


) which was built using Xerox Lcxi- 


cal Tools ( 


Karttunen and Beesley, 1992; 


Karttuncn, 



1993). In our work we do not modify the corre- 
sponding source lexicon but we employ our finite- 
state calculus to map the lexical transducer into a 
new one. Writing rules that map a tag or a sequence 
of tags into a new tag is rather straightforward, but 
redefining the source lexicon would imply complex 
and time consuming work. 

The initial lexicon contains all the inflectional in- 
formation. For instance, the word danses (the plural 
of the noun danse or a second person form of the verb 
danser (to dance) has the following analyses^]: 

danser +IndP +SG +P2 +Verb 
danser +SubjP +SG +P2 +Verb 
danse +Fem +PL +Noun 

Forms that include clitics are analysed as a se- 
quence of items separated by the symbols < or > 
depending on whether the clitics precede or follow 
the head word. For instance vient-il (does he come, 
lit. comes-he) is analysed as^|: 

venir +IndP +SG +P3 +Verb 

> il +Nom +Masc +SG +P3 +PC 

From this basic morphological transducer, we de- 
rived a new lexicon that matches the reduced tagset 
described above. This involved two major opera- 
tions: 

• handling cliticised forms appropriately for the 
tagger's needs. 

• switching tagsets 

In order to reduce the number of tags, cliticised 
items (like vient-il are split into independent tokens 
for the tagging application. This splitting is per- 
formed at an early stage by the tokeniser, before 
dictionary lookup. Keeping track of the fact that the 
tokens were initially agglutinated reduces the overall 
ambiguity. For instance, if the word danses is de- 
rived from the expression danses-tu (do you dance, 
lit. dance- you), then it can only be a verb read- 
ing. This is why forms like danses-tu are tokenised 
as danses- and tu, and forms like chante-t-il are to- 
kenised as chante-t- and il. This in turn requires 
that forms like danses- and chante-t- be introduced 
into the new lexicon. 

2 The tags represent: present muicui/tve, tsnujw 
ond person, verb; present subjunctive, singular, 

]■ o n /A T/Dim inn nnr\ o tila i rfr, I nr\f\nim 



ndicative, singular, sec- 

■ • • - 1 1 .. i •. i 1 > . j 1 1 ■ ~ < iii ■ -■ s ( 1 > ,f uui IctllJ&y SITlCjlllCLT y S GCOTld 

person, verb; and feminine, plural, noun 

3 The tags for il represent: nominative, masculine, 
ngular, third person, clitic pronoun. 



sinq 



With respect to switching tagsets, we use contex- 
tual two-level rules that turn the initial tags into 
new tags or to the void symbol if old tags must 
simply disappear. For instance, the symbol +Verb 
is transformed into + VERB-P3SG if the immediate 
left context consists of the symbols +SG +P3. The 
symbols +IndP, +SG and +P3 are then transduced 
to the void symbol, so that vient (or even the new 
token vient-) gets analysed merely as + VERB-P3SG 
instead of +IndP +SG +P3 + Verb. 

A final transformation consists in associating a 
given surface form with its ambiguity class, i.e. with 
the alphabetically ordered sequence of all its possi- 
ble tags. For instance danses is associated with the 
ambiguity class [+NOUN-PL + VERB-P1P2], i.e. it 
is either a plural noun or a verb form that belongs 
to the collapsed first or second person paradigm. 

4 Guesser 

Words not found in the lexicon are analysed by a 
separate finite-state transducer, the guesser. We 
developed a simple, extremely compact and effi- 
cient guesser for French. It is based on the general 
assumption that neologisms and uncommon words 
tend to follow regular inflectional patterns. 

The guesser is thus based on productive endings 
(like merit for adverbs, ible for adjectives, er for 
verbs). A given ending may point to various cat- 
egories, e.g. er identifies not only infinitive verbs 
but also nouns, due to possible borrowings from En- 
glish. For instance, the ambiguity class for killer is 
[NOUN-SG VERB-INF]. 

These endings belong to the most frequent end- 
ing patterns in the lexicon, where every rare word 
weights as much as any frequent word. Endings 
are not selected according to their frequency in run- 
ning texts, because highly frequent words tend to 
have irregular endings, as shown by adverbs like ja- 
mais, toujours, peut-etre, hier, souvent (never, al- 
ways, maybe. . .). 

Similarly, verb neologisms belong to the regular 
conjugation paradigm characterised by the infinitive 
ending er, e.g. deballaduriser. 

With respect to nouns, we first selected productive 
endings (iste, eau, eur, rice. . .), until we realised a 
better choice was to assign a noun tag to all endings, 
with the exception of those previously assigned to 
other classes. In the latter case, two situations may 
arise: either the prefix is shared between nouns and 
some other category (such as ment), or it must be 
barred from the list of noun endings (such as aient, 
an inflectional marking of third person plural verbs). 
We in fact introduced some hierarchy into the end- 
ings: e.g. mentis shared by adverbs and nouns, while 



iquement is assigned to adverbs only. 

Guessing based on endings offers some side ad- 
vantages: unknown words often result from alterna- 
tions, which occur at the beginning of the word, the 
rest remaining the same, e.g. derivational prefixes as 
in israelo-jordano-palestinienne but also oral tran- 
scriptions such as les z'oreilles (the ears), with z' 
marking the phonological liaison. Similarly, spelling 
errors which account for many of the unknown words 
actually affect the ending less than the internal 
structure of the word, e.g. the misspelt verb forms 
appellaient, geulait. Hyphens used to emphasise a 
word, e.g. har-mo-ni-ser, also leave endings unal- 
tered. Those side advantages do not however operate 
when the alternation (prefix, spelling error) applies 
to a frequent word that does not follow regular end- 
ing patterns. For instance, the verb construit and 
the adverb tres are respectively misspelt as constuit 
and tres, and are not properly recognised. 

Generally, the guesser does not recognise words 
belonging to closed classes (conjunctions, preposi- 
tions, etc.) under the assumption that closed classes 
are fully described in the basic lexicon. A possible 
improvement to the guesser would be to incorporate 
frequent spelling errors for words that are not oth- 
erwise recognised. 

4.1 Testing the guesser 

We extracted, from a corpus of newspaper arti- 
cles (Liberation), a list of 13 500 words unknown 
to the basic lexicon^]. Of those unknown words, 
9385 (i.e. about 70 %) are capitalised words, which 
are correctly and unambiguously analysed by the 
guesser as proper nouns with more than 95 % ac- 
curacy. Errors are mostly due to foreign capitalised 
words which are not proper nouns (such as Eight) 
and onomatopoeia (such as Ooooh). 

The test on the remaining 4000 non-capitalised 
unknown words is more interesting. We randomly 
selected 800 of these words and ran the guesser on 
them. 1192 tags were assigned to those 800 words 
by the guesser, which gives an average of 1.5 tags 
per word. 

For 113 words, at least one required tag was miss- 
ing (118 tags were missing as a whole, 4 words were 
lacking more than one tag: they are misspelt irreg- 
ular verbs that have not been recognised as such). 
This means that 86 % of the words got all the re- 
quired tags from the guesser. 

273 of the 1192 tags were classified as irrelevant. 
This concerned 244 words, which means that 70 % 

4 On various large newspaper corpora, an average of 
18 % words are unknown: this is mostly due to the high 
frequency of proper nouns. 



of the words did not get any irrelevant tags. Finally, 
63 % of the words got all the required tags and only 
those. 

If we combine the evaluation on capitalised and 
non-capitalised words, 85 % of all unknown words 
are perfectly tagged by the guesser, and 92 % get 
all the necessary tags (with possibly some unwanted 
ones). 

The test on the non-capitalised words was tough 
enough as we counted as irrelevant any tag that 
would be morphologically acceptable on general 
grounds, but which is not for a specific word. For 
instance, the misspelt word statisiticiens is tagged 
as [ADJ-PL NOUN-PL]; we count the ADJ-PL tag 
as irrelevant, on the ground that the underlying cor- 
rect word statisticiens is a noun only (compare with 
the adjective platoniciens) . 

The same occurs with words ending in ement that 
are systematically tagged as [ADV NOUN-SG], un- 
less a longer ending like iquement is recognised. This 
often, but not always, makes the NOUN-SG tag ir- 
relevant. 

As for missing tags, more than half are adjective 
tags for words that are otherwise correctly tagged 
as nouns or past participles (which somehow reduces 
the importance of the error, as the syntactic distri- 
bution of adjectives overlaps with those of nouns and 
past participles). 

The remaining words that lack at least one tag 
include misspelt words belonging to closed classes 
(come, ires, vavec) or to irregular verbs (constuit), 
barbarisms resulting from the omission of blanks 
(proposde), or from the adjunction of superfluous 
blanks or hyphens (quand-meme, so ciete). We also 
had a few examples of compound nouns improperly 
tagged as singular nouns, e.g. rencontres-tele, where 
the plural marking only appears on the first element 
of the compound. 

Finally, foreign words represent another class of 
problematic words, especially if they are not nouns. 
We found various English examples (at, born, of, 
enough, easy) but also Spanish, e.g. levantarse, and 
Italian ones, e.g. palazzi. 

5 Conclusion 

We have described the tagset, lexicon and guesser 
that we built for our French tagger. In this work, 
we re-used an existing lexicon. We composed this 
lexicon with finite-state transducers (mapping rules) 
in order to produce a new lexical transducer with the 
new tagset. The guesser for words that are not in the 
lexicon is described in more detail. Some test results 
are given. The disambiguation itself is described in 
(Chanod and Tapanaincn, 1995). 
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